[step 1]refine code to support all devices in torch and hot fix for gemma4-unified#1879
[step 1]refine code to support all devices in torch and hot fix for gemma4-unified#1879wenhuach21 wants to merge 43 commits into
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR refactors device handling by introducing a unified DeviceManager abstraction and updating existing utilities to use it, with the intent of reducing scattered backend-specific (cuda/xpu/hpu) branching across the codebase.
Changes:
- Added
auto_round/utils/device_manager.pyto centralize backend discovery and runtime ops (sync/cache/memory queries). - Updated
auto_round/utils/device.pyto route device counting, selection, memory clearing, and memory queries through the new device manager APIs. - Updated
auto_round/auto_scheme/delta_loss.pyto synchronize via the active device manager and broaden “non-CPU device” checks.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| auto_round/utils/device.py | Replaces backend-specific device/memory logic with DeviceManager calls. |
| auto_round/utils/device_manager.py | New unified device backend abstraction (discovery + runtime + memory APIs). |
| auto_round/auto_scheme/delta_loss.py | Uses DeviceManager for synchronization and generalized non-CPU device checks. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…into refine_device_1 # Conflicts: # auto_round/utils/device_manager.py
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
for more information, see https://pre-commit.ci
…into refine_device_1
for more information, see https://pre-commit.ci
…into refine_device_1
for more information, see https://pre-commit.ci
…into refine_device_1
for more information, see https://pre-commit.ci
…into refine_device_1 # Conflicts: # auto_round/utils/device_manager.py
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…into refine_device_1 Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com>
for more information, see https://pre-commit.ci
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
some functions have not been refined, I prefer to merge this pr first and refine them later to avoid conflicts. please have a review when you are free |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…into refine_device_1
for more information, see https://pre-commit.ci
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| return pipe, model | ||
|
|
||
|
|
||
| _PRE_DEFINED_FIXED_ATTR = {"gemma4_unified": {"has_variable_block_shape": True}} |
There was a problem hiding this comment.
has_variable_block_shape = True causes every block to cache its own inputs,
which increases VRAM usage by approximately N× (N = total number of blocks).
Compared to the original Gemma4 approach that uses a per-layer forward
monkey-patch to dynamically rebuild position_embeddings at replay time
(zero extra cache), this is a significant memory trade-off.
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Please briefly describe your main changes, the motivation.
Type of Change
Bug fix
Related Issues
Fixes or relates to #
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.